:orphan:

Core Basics 2: Train a Classifier on a Star Multi-Table Dataset
===============================================================

In this notebook we learn how to train a classifier with a multi-table
data composed of two tables (a root table and a secondary table). It is
highly recommended to see the *Core Basics 1* lesson if you are not
familiar with Khiops.

Make sure you have installed `Khiops <https://khiops.org/setup/>`__ and
`Khiops Visualization <https://khiops.org/setup/visualization/>`__.

We start by importing Khiops, checking its installation and defining
some helper functions:

.. code:: ipython3

    import os
    import platform
    import subprocess
    from khiops import core as kh
    
    # Define peek helper function
    def peek(file_path, n=10):
        """Shows the first n lines of a file"""
        with open(file_path, encoding="utf8", errors="replace") as file:
            for line in file.readlines()[:n]:
                print(line, end="")
        print("")
    
    
    # If there are any issues you may Khiops status with the following command
    # kh.get_runner().print_status()

Training a Multi-Table Classifier
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We’ll train a “sarcasm detector” using the dataset ``HeadlineSarcasm``.
In its raw form, it contains a list of text headlines paired with a
label that indicates whether its source is a sarcastic site (such as
`The Onion <https://www.theonion.com>`__) or not.

We have transformed this dataset into two tables such that the
text-label record

::

   "groundbreaking study finds gratification can be deliberately postponed"    yes

is transformed to an entry in a table that contains id-label records

::

   97 yes

and various entries in a secondary table linking a headline id to its
words and positions

::

   97  0   groundbreaking
   97  1   study
   97  2   finds
   97  3   gratification
   97  4   can
   97  5   be
   97  6   deliberately
   97  7   postponed

Thus the ``HeadlineSarcasm`` dataset has the following multi-table
schema

::

    +-----------+
    |Headline   |
    +-----------+ +-------------+
    |HeadlineId*| |HeadlineWords|
    |IsSarcastic| +-------------+
    +-----------+ |HeadlineId*  |
         |        |Position     |
         +-1:n--->|Word         |
                  +-------------+

The ``HeadlineId`` variable is special because it is a *key* that links
a particular headline to its words (a 1:n relation).

*Note: There are other methods more appropriate for this text-mining
problem. This multi-table setup is only for pedagogical purporses.*

To train a classifier with Khiops in this multi-table setup, this schema
must be codified in the dictionary file. Let’s check the contents of the
``HeadlineSarcasm`` dictionary file:

.. code:: ipython3

    sarcasm_kdic = os.path.join("data", "HeadlineSarcasm", "HeadlineSarcasm.kdic")
    
    print(f"HeadlineSarcasm dictionary file: {sarcasm_kdic}")
    print("")
    peek(sarcasm_kdic, n=15)


.. parsed-literal::

    HeadlineSarcasm dictionary file: data/HeadlineSarcasm/HeadlineSarcasm.kdic
    
    Root Dictionary Headline(HeadlineId)
    {
      Categorical HeadlineId;
      Categorical IsSarcasm;
      Table(Words) HeadlineWords;
    };
    
    Dictionary Words(HeadlineId)
    {
      Categorical HeadlineId;
      Numerical Position;
      Categorical Word;
    };
    
    
As in the single-table case the ``.kdic``\ file describes the schema for
both tables, but note the following differences: - The dictionary for
the table ``Headline`` is prefixed by the ``Root`` keyword to indicate
that is the main one. - For both tables, their dictionary names are
followed by ``(HeadlineId)`` to indicate that ``HeadlineId`` is the key
of these tables. - The schema for the main table contains an extra
special variable defined with the statement
``Table(Words) HeadlineWords``. This is, in addition to sharing the same
key variable, is necessary to indicate the ``1:n`` relationship between
the main and secondary table.

Now let’s store the location main and secondary tables and peek their
contents:

.. code:: ipython3

    sarcasm_headlines_file = os.path.join("data", "HeadlineSarcasm", "Headlines.txt")
    sarcasm_words_file = os.path.join("data", "HeadlineSarcasm", "HeadlineWords.txt")
    
    print(f"HeadlineSarcasm main table file: {sarcasm_headlines_file}")
    print("")
    peek(sarcasm_headlines_file, n=3)
    
    print(f"HeadlineSarcasm secondary table file location: {sarcasm_words_file}")
    print("")
    peek(sarcasm_words_file, n=15)


.. parsed-literal::

    HeadlineSarcasm main table file: data/HeadlineSarcasm/Headlines.txt
    
    HeadlineId	IsSarcasm
    0	yes
    1	no
    
    HeadlineSarcasm secondary table file location: data/HeadlineSarcasm/HeadlineWords.txt
    
    HeadlineId	Position	Word
    0	0	thirtysomething
    0	1	scientists
    0	2	unveil
    0	3	doomsday
    0	4	clock
    0	5	of
    0	6	hair
    0	7	loss
    1	0	dem
    1	1	rep.
    1	2	totally
    1	3	nails
    1	4	why
    1	5	congress
    

The call to the ``train_predictor`` will be very similar to the
single-table case but there are some differences.

The first is that we must pass the path of the extra secondary data
table. This is done with the ``additional_data_tables`` parameter that
is a Python dictionary containing key-value pairs for each table. More
precisely: - keys describe *data paths* of secondary tables. In this
case only :literal:`Headline`HeadlineWords` - values describe the *file
paths* of secondary tables. In this case only the file path we stored in
``sarcasm_words_file``

*Note: For understanding what data paths are see the “Multi-Table Tasks”
section of the Khiops ``core.api`` documentation*

Secondly, we specify how many features/aggregates Khiops will create
with its multi-table AutoML mode. For the ``HeadlineSarcasm`` dataset
Khiops can create features such as: - *Number of different words in the
headline* - *Most common word in the headline before the third one* -
*Number of times the word ‘the’ appears* - …

It will then evaluate, select and combine the created features to build
a classifier. We’ll ask to create ``1000`` of these features (the
default is ``100``).

With these considerations, let’s setup the some extra variables and
train the classifier:

.. code:: ipython3

    sarcasm_results_dir = os.path.join("exercises", "HeadlineSarcasm")
    
    sarcasm_report, sarcasm_model_kdic = kh.train_predictor(
        sarcasm_kdic,
        dictionary_name="Headline",  # This must be the main/root dictionary
        data_table_path=sarcasm_headlines_file,  # This must be the data file for the main table
        target_variable="IsSarcasm",
        results_dir=sarcasm_results_dir,
        additional_data_tables={"Headline`HeadlineWords": sarcasm_words_file},
        max_constructed_variables=1000,  # by default Khiops constructs 100 variables for AutoML multi-table
        max_trees=0,  # by default Khiops constructs 10 decision tree variables
    )
    print(f"HeadlineSarcasm report file located at: {sarcasm_report}")
    print(f"HeadlineSarcasm modeling dictionary file located at: {sarcasm_model_kdic}")


.. parsed-literal::

    HeadlineSarcasm report file located at: exercises/HeadlineSarcasm/AllReports.khj
    HeadlineSarcasm modeling dictionary file located at: exercises/HeadlineSarcasm/Modeling.kdic


We now may take a look at the results with the visualization tool:

.. code:: ipython3

    # To visualize uncomment the line below
    # kh.visualize_report(sarcasm_report)

*Note: In the multi-table case, the input tables must be sorted by their
key column in lexicographical order. To do this you may use the Khiops
``sort_data_table`` function or your favorite software. The examples of
this tutorial have their tables pre-sorted.*

Exercise time!
~~~~~~~~~~~~~~

Repeat the previous steps with the ``AccidentsSummary`` dataset. It
describes the characteristics of traffic accidents that happened in
France in 2018. It has two tables with the following schema:

::

   +---------------+
   |Accidents      |
   +---------------+
   |AccidentId*    |
   |Gravity        |
   |Date           |
   |Hour           | +---------------+
   |Light          | |Vehicles       |
   |Department     | +---------------+
   |Commune        | |AccidentId*    |
   |InAgglomeration| |VehicleId*     |
   |...            | |Direction      |
   +---------------+ |Category       |
          |          |PassengerNumber|
          +---1:n--->|...            |
                     +---------------+

So for each accident we have its characteristics (such as ``Gravity`` or
``Light`` conditions) and those of each involved vehicle (its
``Direction`` or ``PassengerNumber``). The main task for this dataset is
to predict the variable ``Gravity`` that has two possible
values:``Lethal`` and ``NonLethal``.

We first save the paths of the ``AccidentsSummary`` dictionary file and
data table files into variables:

.. code:: ipython3

    accidents_kdic = os.path.join(
        kh.get_samples_dir(), "AccidentsSummary", "Accidents.kdic"
    )
    accidents_data_file = os.path.join(
        kh.get_samples_dir(), "AccidentsSummary", "Accidents.txt"
    )
    vehicles_data_file = os.path.join(
        kh.get_samples_dir(), "AccidentsSummary", "Vehicles.txt"
    )

Print the file locations and use the function ``peek`` to list their contents
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Which table is the ``Root`` in this case?

.. code:: ipython3

    print(f"Accidents dictionary file: {accidents_kdic}")
    print("")
    peek(accidents_kdic, n=40)
    
    print(f"Accidents (main) data table: {accidents_data_file}")
    print("")
    peek(accidents_data_file)
    
    print(f"Vehicles data table: {vehicles_data_file}")
    print("")
    peek(vehicles_data_file)


.. parsed-literal::

    Accidents dictionary file: /github/home/khiops_data/samples/AccidentsSummary/Accidents.kdic
    
    Root Dictionary Accident(AccidentId)
    {
      Categorical AccidentId;
      Categorical	Gravity;
      Date Date;
      Time Hour;
      Categorical Light;
      Categorical Department;
      Categorical Commune;
      Categorical InAgglomeration;
      Categorical IntersectionType;
      Categorical Weather;
      Categorical CollisionType;
      Categorical PostalAddress;
      Table(Vehicle) Vehicles;
    };
    
    Dictionary Vehicle(AccidentId, VehicleId)
    {
     Categorical AccidentId;
     Categorical VehicleId;
     Categorical Direction;
     Categorical Category;
     Numerical PassengerNumber;
     Categorical FixedObstacle;
     Categorical MobileObstacle;
     Categorical ImpactPoint;
     Categorical Maneuver;
    };
    
    Accidents (main) data table: /github/home/khiops_data/samples/AccidentsSummary/Accidents.txt
    
    AccidentId	Gravity	Date	Hour	Light	Department	Commune	InAgglomeration	IntersectionType	Weather	CollisionType	PostalAddress
    201800000001	NonLethal	2018-01-24	15:05:00	Daylight	590	005	No	Y-type	Normal	2Vehicles-BehindVehicles-Frontal	route des Ansereuilles
    201800000002	NonLethal	2018-02-12	10:15:00	Daylight	590	011	Yes	Square	VeryGood	NoCollision	Place du général de Gaul
    201800000003	NonLethal	2018-03-04	11:35:00	Daylight	590	477	Yes	T-type	Normal	NoCollision	Rue  nationale
    201800000004	NonLethal	2018-05-05	17:35:00	Daylight	590	052	Yes	NoIntersection	VeryGood	2Vehicles-Side	30 rue Jules Guesde
    201800000005	NonLethal	2018-06-26	16:05:00	Daylight	590	477	Yes	NoIntersection	Normal	2Vehicles-Side	72 rue Victor Hugo
    201800000006	NonLethal	2018-09-23	06:30:00	TwilightOrDawn	590	052	Yes	NoIntersection	LightRain	Other	D39
    201800000007	NonLethal	2018-09-26	00:40:00	NightStreelightsOn	590	133	Yes	NoIntersection	Normal	Other	4 route de camphin
    201800000008	Lethal	2018-11-30	17:15:00	NightStreelightsOn	590	011	Yes	NoIntersection	Normal	Other	rue saint exupéry
    201800000009	NonLethal	2018-02-18	15:57:00	Daylight	590	550	No	NoIntersection	Normal	Other	rue de l'égalité
    
    Vehicles data table: /github/home/khiops_data/samples/AccidentsSummary/Vehicles.txt
    
    AccidentId	VehicleId	Direction	Category	PassengerNumber	FixedObstacle	MobileObstacle	ImpactPoint	Maneuver
    201800000001	A01	Unknown	Car<=3.5T	0	None	Vehicle	RightFront	TurnToLeft
    201800000001	B01	Unknown	Car<=3.5T	0	None	Vehicle	LeftFront	NoDirectionChange
    201800000002	A01	Unknown	Car<=3.5T	0	None	Pedestrian	None	NoDirectionChange
    201800000003	A01	Unknown	Motorbike>125cm3	0	StationaryVehicle	Vehicle	Front	NoDirectionChange
    201800000003	B01	Unknown	Car<=3.5T	0	None	Vehicle	LeftSide	TurnToLeft
    201800000003	C01	Unknown	Car<=3.5T	0	None	None	RightSide	Parked
    201800000004	A01	Unknown	Car<=3.5T	0	None	Other	RightFront	Avoidance
    201800000004	B01	Unknown	Bicycle	0	None	Vehicle	LeftSide	None
    201800000005	A01	Unknown	Moped	0	None	Vehicle	RightFront	PassLeft
    

We now save the results directory for this exercise:

.. code:: ipython3

    accidents_results_dir = os.path.join("exercises", "AccidentSummary")
    print(f"AccidentsSummary exercise results directory: {accidents_results_dir}")


.. parsed-literal::

    AccidentsSummary exercise results directory: exercises/AccidentSummary


Train a classifier for the ``Accidents`` database with 1000 variables
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Save the resulting file locations into the variables
``accidents_report`` and ``accidents_model_kdic`` and print them.

Do not forget: - The target variable is ``Gravity`` - The key for the
``additional_data_tables`` parameter is :literal:`Accident`Vehicles` and
its value that of ``vehicles_data_file`` - Set ``max_trees=0``

.. code:: ipython3

    accidents_report, accidents_model_kdic = kh.train_predictor(
        accidents_kdic,
        dictionary_name="Accident",
        data_table_path=accidents_data_file,
        target_variable="Gravity",
        results_dir=accidents_results_dir,
        additional_data_tables={"Accident`Vehicles": vehicles_data_file},
        max_constructed_variables=1000,
        max_trees=0,
    )
    print(f"AccidentsSummary report file: {accidents_report}")
    print(f"AccidentsSummary modeling dictionary: {accidents_model_kdic}")


.. parsed-literal::

    AccidentsSummary report file: exercises/AccidentSummary/AllReports.khj
    AccidentsSummary modeling dictionary: exercises/AccidentSummary/Modeling.kdic


Take a look to the report
^^^^^^^^^^^^^^^^^^^^^^^^^

Which variables predict well the gravity of an accident?

.. code:: ipython3

    # To visualize uncomment the line below
    # kh.visualize_report(accidents_report)